Lab 3 - Extended Logistic Regression¶

For this lab, we are considering the diamond dataset to develop a model for predicting the price of a diamond.

The dataset is a collection of approximately 54,000 observations regarding different aspects and traits of a diamond.

This dataset was sourced from kaggle, though the data originated from Tiffany & Co.'s 2017 snapshot pricelist.

Team¶

Conducting this analysis is a team of three:

  1. Samina Faheem
  2. Giancarlos Dominguez
  3. Melodie Zhu

Preparation and Overview¶

Business Understanding¶

With this dataset, we hope to create a model that can accurately predict the price of a diamond when considering different aspects and traits of the gemstone. This model is meant to automate and speed up the process of determining a diamond's price by its features. This dataset includes four of the most important qualities to determine a diamond's value: color, clarity, cut, and carat weight. These factors, along with the others, help to ascertain the rarity of a diamond, therefore the value and price. Our model will take in the diamond data and make a prediction on the price of a diamond. The different aspects of a diamond can significiantly change its price range, making it very useful for companies involved in the diamond industry, whether they are making diamonds or selling them. Our model would be most useful in offline analysis, once all the information on the diamond has been determined and collected.

We hope that our prediction model will achieve an accuracy of 95%.

Loading the Dataset¶

In [1]:
# import necessary libraries
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import plotly
import copy
import cv2
import os
In [2]:
from sklearn import metrics as mt
In [3]:
# Add name of dataset to absolute path
data_directory = os.getcwd() 

# import csv file
df = pd.read_csv(data_directory + '\diamonds.csv')
df.head()
Out[3]:
carat cut color clarity depth table price 'x' 'y' 'z'
0 0.23 b'Ideal' b'E' b'SI2' 61.5 55.0 326.0 3.95 3.98 2.43
1 0.21 b'Premium' b'E' b'SI1' 59.8 61.0 326.0 3.89 3.84 2.31
2 0.23 b'Good' b'E' b'VS1' 56.9 65.0 327.0 4.05 4.07 2.31
3 0.29 b'Premium' b'I' b'VS2' 62.4 58.0 334.0 4.20 4.23 2.63
4 0.31 b'Good' b'J' b'SI2' 63.3 58.0 335.0 4.34 4.35 2.75

Cleaning the Dataset¶

It should be acknowledged that there are limitations with this dataset. As we progress through cleaning the dataset, we will be able to better explain the specific limitations of the dataset.

Let's see the size of the dataset and the different datatypes of its features.

In [4]:
print(f"Number of observations in training data: {df.shape[0]}")
print(f"Number of features in training data: {df.shape[1]}\n\n")

print(df.dtypes)
Number of observations in training data: 53940
Number of features in training data: 10


carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price      float64
'x'        float64
'y'        float64
'z'        float64
dtype: object

There are 10 features in this dataset. The features are:

  • carat: The weight of the diamond
  • cut: How the diamond's edges have been cut
  • color: The color of the diamond
  • clarity: The measurement of the diamond's purity and rarity
  • depth (should be depth percentage): the percentage of light interacting with a diamond's edges
  • table: The length of a diamond's flat top
  • price: The price of a diamond after considering its attributes (Target Label)
  • x: length of diamond (mm)
  • y: width of diamond (mm)
  • z: depth of diamond (mm)

Althought these charcateristics are important whe determining the price of a diamond, a limit of this dataset is that it only considers these charcateristics. One factor to consider would be the diamond's country of origin, which can influence the quality and price of a diamond.

Another limitation is that there is only roughly 53,000 diamond observations. Though this may seem plenty, our model would benefit from additional data for training.

An additional limitation is that this dataset was taken from a U.S company (Tiffany & Co.), which may not be representative of the overall diamond market and supply.

Column Renaming¶

Next, let's rename the x,y,z, and depth so that the column name is more representative of the meaning of the values. Renaming the columns is more intuitive for readers to understand the values in the dataset.

For the depth column (not z, which is the actual depth), it is wrongly named. It is the depth percentage of a diamond, which indicates the "depth of a diamond in relation to its width" (With Clarity). Depth Percentage is calculated by dividing total depth by the average diameter, then multiplying by 100.

In [5]:
df = df.rename(columns={'depth': 'depth_percentage'})

Since the x, y, and z features represent the length, width, and depth of a diamond, respectively, we will change the columns names accordingly.

In [6]:
# changing column names
df = df.rename(columns={"'x'": 'length'})
df = df.rename(columns={"'y'": 'width'})
df = df.rename(columns={"'z'": 'depth'})
In [7]:
print(df.dtypes)
carat               float64
cut                  object
color                object
clarity              object
depth_percentage    float64
table               float64
price               float64
length              float64
width               float64
depth               float64
dtype: object

Renaming the columns is more intuitive for readers to understand the values in the dataset.
Next, we will perform actual dataset cleaning. Here, we can consider additional limitations such as data quality issues and inconsistencies.

Checking for Null Values¶

We will check if there any any null values in our dataset. Having too many null values can impact the performance of our model.

In [8]:
# check for missing value
df.isnull().sum()
Out[8]:
carat               0
cut                 0
color               0
clarity             0
depth_percentage    0
table               0
price               0
length              0
width               0
depth               0
dtype: int64

There are no null values! We won't have to worry about missing data with this dataset. However, we should check for any unusual values such as 0's.

Checking for 0 values¶

In [9]:
# check for 0s in each column
df.isin([0]).sum()
Out[9]:
carat                0
cut                  0
color                0
clarity              0
depth_percentage     0
table                0
price                0
length               8
width                7
depth               20
dtype: int64

We observe that there are 8 zeros in the length column, 7 zeros in the width column, and 20 zeros in the depth column, while the rest of the features report no missing columns. Let's print out the rows that have zeros along the length, width, and depth column.

In [10]:
# create new variable: price_range
df['price_range'] = pd.cut(df['price'],
                                 [300,900,2000,4000,6500,1e6],
                                 labels=['0','1','2','3','4'])
df.price_range.describe()

# convert price_range to numeric
df["price_range"]=df["price_range"].astype(str).astype(int)

# drop the price column
df.drop(['price'], axis=1, inplace=True)
In [11]:
# now lets group with the new variable
df_grouped = df.groupby(by=['cut','price_range'])
print ("number of cut falling in specific price range:")
df_grouped.count()
number of cut falling in specific price range:
Out[11]:
carat color clarity depth_percentage table length width depth
cut price_range
b'Fair' 0 76 76 76 76 76 76 76 76
1 312 312 312 312 312 312 312 312
2 591 591 591 591 591 591 591 591
3 344 344 344 344 344 344 344 344
4 287 287 287 287 287 287 287 287
b'Good' 0 1011 1011 1011 1011 1011 1011 1011 1011
1 825 825 825 825 825 825 825 825
2 1196 1196 1196 1196 1196 1196 1196 1196
3 1063 1063 1063 1063 1063 1063 1063 1063
4 811 811 811 811 811 811 811 811
b'Ideal' 0 5690 5690 5690 5690 5690 5690 5690 5690
1 5912 5912 5912 5912 5912 5912 5912 5912
2 3755 3755 3755 3755 3755 3755 3755 3755
3 2577 2577 2577 2577 2577 2577 2577 2577
4 3617 3617 3617 3617 3617 3617 3617 3617
b'Premium' 0 2631 2631 2631 2631 2631 2631 2631 2631
1 2767 2767 2767 2767 2767 2767 2767 2767
2 2281 2281 2281 2281 2281 2281 2281 2281
3 2809 2809 2809 2809 2809 2809 2809 2809
4 3303 3303 3303 3303 3303 3303 3303 3303
b'Very Good' 0 2981 2981 2981 2981 2981 2981 2981 2981
1 2002 2002 2002 2002 2002 2002 2002 2002
2 2531 2531 2531 2531 2531 2531 2531 2531
3 2258 2258 2258 2258 2258 2258 2258 2258
4 2310 2310 2310 2310 2310 2310 2310 2310
In [12]:
df["price_range"]=df["price_range"].astype(str).astype(int)
In [13]:
# print row index of rows with zeros in length column
df.loc[df['length'] == 0]
Out[13]:
carat cut color clarity depth_percentage table length width depth price_range
11182 1.07 b'Ideal' b'F' b'SI2' 61.6 56.0 0.0 6.62 0.0 3
11963 1.00 b'Very Good' b'H' b'VS2' 63.3 53.0 0.0 0.00 0.0 3
15951 1.14 b'Fair' b'G' b'VS1' 57.5 67.0 0.0 0.00 0.0 3
24520 1.56 b'Ideal' b'G' b'VS2' 62.2 54.0 0.0 0.00 0.0 4
26243 1.20 b'Premium' b'D' b'VVS1' 62.1 59.0 0.0 0.00 0.0 4
27429 2.25 b'Premium' b'H' b'SI2' 62.8 59.0 0.0 0.00 0.0 4
49556 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.0 0.00 0.0 2
49557 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.0 0.00 0.0 2
In [14]:
# print row index of rows with zeros in width column
df.loc[df['width'] == 0]
Out[14]:
carat cut color clarity depth_percentage table length width depth price_range
11963 1.00 b'Very Good' b'H' b'VS2' 63.3 53.0 0.0 0.0 0.0 3
15951 1.14 b'Fair' b'G' b'VS1' 57.5 67.0 0.0 0.0 0.0 3
24520 1.56 b'Ideal' b'G' b'VS2' 62.2 54.0 0.0 0.0 0.0 4
26243 1.20 b'Premium' b'D' b'VVS1' 62.1 59.0 0.0 0.0 0.0 4
27429 2.25 b'Premium' b'H' b'SI2' 62.8 59.0 0.0 0.0 0.0 4
49556 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.0 0.0 0.0 2
49557 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.0 0.0 0.0 2
In [15]:
# print row index of rows with zeros in depth column
df.loc[df['depth'] == 0]
Out[15]:
carat cut color clarity depth_percentage table length width depth price_range
2207 1.00 b'Premium' b'G' b'SI2' 59.1 59.0 6.55 6.48 0.0 2
2314 1.01 b'Premium' b'H' b'I1' 58.1 59.0 6.66 6.60 0.0 2
4791 1.10 b'Premium' b'G' b'SI2' 63.0 59.0 6.50 6.47 0.0 2
5471 1.01 b'Premium' b'F' b'SI2' 59.2 58.0 6.50 6.47 0.0 2
10167 1.50 b'Good' b'G' b'I1' 64.0 61.0 7.15 7.04 0.0 3
11182 1.07 b'Ideal' b'F' b'SI2' 61.6 56.0 0.00 6.62 0.0 3
11963 1.00 b'Very Good' b'H' b'VS2' 63.3 53.0 0.00 0.00 0.0 3
13601 1.15 b'Ideal' b'G' b'VS2' 59.2 56.0 6.88 6.83 0.0 3
15951 1.14 b'Fair' b'G' b'VS1' 57.5 67.0 0.00 0.00 0.0 3
24394 2.18 b'Premium' b'H' b'SI2' 59.4 61.0 8.49 8.45 0.0 4
24520 1.56 b'Ideal' b'G' b'VS2' 62.2 54.0 0.00 0.00 0.0 4
26123 2.25 b'Premium' b'I' b'SI1' 61.3 58.0 8.52 8.42 0.0 4
26243 1.20 b'Premium' b'D' b'VVS1' 62.1 59.0 0.00 0.00 0.0 4
27112 2.20 b'Premium' b'H' b'SI1' 61.2 59.0 8.42 8.37 0.0 4
27429 2.25 b'Premium' b'H' b'SI2' 62.8 59.0 0.00 0.00 0.0 4
27503 2.02 b'Premium' b'H' b'VS2' 62.7 53.0 8.02 7.95 0.0 4
27739 2.80 b'Good' b'G' b'SI2' 63.8 58.0 8.90 8.85 0.0 4
49556 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.00 0.00 0.0 2
49557 0.71 b'Good' b'F' b'SI2' 64.1 60.0 0.00 0.00 0.0 2
51506 1.12 b'Premium' b'G' b'I1' 60.4 59.0 6.71 6.67 0.0 2

These inconsistences are data quality issues, since it is impossible for a diamond to have 0 length and have an 'Ideal' or 'Fair' cut type, hypothetically.

In [16]:
# print out total number of rows with zeros in length, width, and depth column
print(f"Total number of rows with zeros in length, width, and depth column: {df[df['length'] == 0].shape[0] + df[df['width'] == 0].shape[0] + df[df['depth'] == 0].shape[0]}")

# print out percentage of rows with zeros in length, width, and depth column
print(f"Percentage of rows with zeros in length, width, and depth column: {(df[df['length'] == 0].shape[0] + df[df['width'] == 0].shape[0] + df[df['depth'] == 0].shape[0]) / df.shape[0] * 100}%")
Total number of rows with zeros in length, width, and depth column: 35
Percentage of rows with zeros in length, width, and depth column: 0.06488691138301816%
In [17]:
df.loc[df['length'] == 0, 'length'] = df['length'].median()
In [18]:
df.loc[df['depth'] == 0, 'depth'] = df['depth'].median()
In [19]:
df.loc[df['width'] == 0, 'width'] = df['width'].median()

Checking for Duplicate Observations¶

Now, let's check for duplicate rows in our dataset. In Kaggle, the author did mention that there are duplicate observations, and that the x, y, z, table, and depth, columns should be disregarded when finding the duplicates, since they depend on the angle when these values were measured.

In [20]:
# drop duplicates
df.drop_duplicates(keep = False, inplace=True)

print(f"Number of observations after deleting duplicates: {df.shape[0]}")
Number of observations after deleting duplicates: 53299

Deleting the duplicates will reduce the computational power and time needed to go through this dataset.

Dataset Statistics¶

Now that we have cleaned the dataset, let us observe the dataset statistics.

In [21]:
# dataset statistics
df.describe()
Out[21]:
carat depth_percentage table length width depth price_range
count 53299.000000 53299.000000 53299.000000 53299.000000 53299.000000 53299.000000 53299.000000
mean 0.796621 61.747412 57.455065 5.729599 5.733324 3.538592 1.871029
std 0.472033 1.425659 2.230042 1.116808 1.138125 0.700867 1.433017
min 0.200000 43.000000 43.000000 3.730000 3.680000 1.070000 0.000000
25% 0.400000 61.000000 56.000000 4.710000 4.720000 2.910000 1.000000
50% 0.700000 61.800000 57.000000 5.700000 5.710000 3.520000 2.000000
75% 1.040000 62.500000 59.000000 6.540000 6.530000 4.030000 3.000000
max 5.010000 78.200000 95.000000 10.740000 58.900000 31.800000 4.000000

Focusing on standard deviation first, the highest one lies in the price column, which makes since given the large range of prices. The following highest standard deviation is in the table section, meaning that the values spread quite far from the mean value of the column. The rest of the columns except for depth and carat, cluster around the value 1. The depth and carat columns have the lowest standard deviation values, meaning that their values are close to the average mean. For the carat feature, this highlights another limitation of the dataset, since the diamonds do not vary much in their weight. When observing the max value of the carat column, the heaviest diamond is only 5 carats, which is not representative of the overall diamond supply.

To help us with additional observations, let us visualize the feature values.

In [22]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

sns.set()
sns.pairplot(df, hue = "carat", height = 2)
Out[22]:
<seaborn.axisgrid.PairGrid at 0x26fbf6aadc0>

The diagonal graphs represent histograms of a singular feature, while the non-diagonal graphs represent a scatterplot between the intersection of two features. Focusing on the histograms, we can see how all of them except for the length histogram are very narrow, meaning their values are very close to the mean. This can become a limitation since, when the model is being tested, it is more likely for the prediction to be wrong if, for example, the depth value of the test diamond falls outside the limited range of the depth feature.

There are countless observations to be made with these graphs, but we will move on with cleaning the dataset.

In [23]:
import seaborn as sns

cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(df.corr(), cmap=cmap, annot=True)
C:\Users\Ibrahim\AppData\Local\Temp\ipykernel_5788\3684316204.py:5: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(df.corr(), cmap=cmap, annot=True)
Out[23]:
<Axes: >

From the heatmap, we can observe multiple strong positive correlations:

  • Carat and depth, width, length, and price.
  • Price and depth, width, and length.
  • Length and depth and width.

The strongest negative correlation exists between table and depth percentage.

We have deleted all duplicate observations and others that had 0 values in their length, width and depth features. However, because our dataset contains ordinal features, we must go into detail about their relationship to the value of a diamond and the different ranks within each feature. We will One-Hot Encode the color, clarity, and cut columns.

Color¶

According to the article "Diamond Color For Your Engagement Ring: An In-Depth Guide", the lack of color of a diamond is also considered when determining the quality of a diamond.

In [24]:
# print out unique color values
print(df['color'].unique())
["b'E'" "b'I'" "b'J'" "b'H'" "b'F'" "b'G'" "b'D'"]

The following list is the color categories ranked:

  1. D (colorless, best)
  2. E
  3. F
  4. G
  5. H
  6. I
  7. J (Most color, worst)
In [25]:
# one hot encode the color feature
if "color" in df.columns:
    df = pd.get_dummies(df, columns=["color"])

# rename the color columns
df.rename(columns = {
    "color_b\'D\'": "color_D",
    "color_b\'E\'": "color_E",
    "color_b\'F\'": "color_F",
    "color_b\'G\'": "color_G",
    "color_b\'H\'": "color_H",
    "color_b\'I\'": "color_I",
    "color_b\'J\'": "color_J"
}, inplace = True)

Clarity¶

Clarity is the "measurement of purity and rarity" of the genstone. The less internal flaws, also known as a diamond inclusions, a diamond has, the higher grade it receives in clarity (Tiffany & Co).

In [26]:
# print out unique clarity values
print(df['clarity'].unique())
["b'SI2'" "b'SI1'" "b'VS1'" "b'VS2'" "b'VVS2'" "b'VVS1'" "b'I1'" "b'IF'"]

The following list is the clarity categories ranked:

  1. IF (Internally Flawless, best)
  2. VVS1
  3. VVS2
  4. VS1
  5. VS2
  6. SI1
  7. SI2
  8. I1(Included, worst)
In [27]:
# one hot encode the clarity feature
if "clarity" in df.columns:
    df = pd.get_dummies(df, columns=["clarity"])

# rename the clarity columns
df.rename(columns = {
    "clarity_b\'IF\'": "clarity_IF",
    "clarity_b\'VVS1\'": "clarity_VVS1",
    "clarity_b\'VVS2\'": "clarity_VVS2",
    "clarity_b\'VS1\'": "clarity_VS1",
    "clarity_b\'VS2\'": "clarity_VS2",
    "clarity_b\'SI1\'": "clarity_SI1",
    "clarity_b\'SI2\'": "clarity_SI2",
    "clarity_b\'I1\'": "clarity_I1"
}, inplace = True)

Cut¶

A diamond's cut is reflects the gemstone's "final beauty and value" (GIA). The cut of a diamond determines how much light its internal structure is able to reflect, ultimately influencing its value.

In [28]:
# print out unique cut values
print(df['cut'].unique())
["b'Ideal'" "b'Premium'" "b'Good'" "b'Very Good'" "b'Fair'"]

The following list is the cut categories ranked:

  1. Ideal (best)
  2. Premium
  3. Very Good
  4. Good
  5. Fair (worst)
In [29]:
# for each value in cut column, clean the string
if "cut" in df.columns:
    df = pd.get_dummies(df, columns=["cut"])

# rename the cut columns
df.rename(columns = {
    "cut_b\'Fair\'": "cut_Fair",
    "cut_b\'Good\'": "cut_Good",
    "cut_b\'Very Good\'": "cut_Very Good",
    "cut_b\'Premium\'": "cut_Premium",
    "cut_b\'Ideal\'": "cut_Ideal"
}, inplace = True)

Finally, we will turn our target variable into a nominal feature. It would be impossible to predict the price of a diamond with exact precision, down to the last dollar. Therefore, we will categorize the prices by certain ranges.

Final Dataset¶

After cleaning our dataset, we have a total of 26 columns:

In [30]:
# print out columns
print(df.dtypes)
carat               float64
depth_percentage    float64
table               float64
length              float64
width               float64
depth               float64
price_range           int32
color_D               uint8
color_E               uint8
color_F               uint8
color_G               uint8
color_H               uint8
color_I               uint8
color_J               uint8
clarity_I1            uint8
clarity_IF            uint8
clarity_SI1           uint8
clarity_SI2           uint8
clarity_VS1           uint8
clarity_VS2           uint8
clarity_VVS1          uint8
clarity_VVS2          uint8
cut_Fair              uint8
cut_Good              uint8
cut_Ideal             uint8
cut_Premium           uint8
cut_Very Good         uint8
dtype: object

We kept carat, depth_percentage, table, length, width, and the depth column intact, excluding name changes.

For our target feature, originally price, now price_range, we changed it to a nominal variable:
We start our price range at $300 since the minimum value is $326.

  • 0: $300 - $900
  • 1: $901 - $2000
  • 2: $2001 - $4000
  • 3: $4001 - $6500
  • 4: $6501+

From the cut column, we created five more:

  • cut_Ideal
  • cut_Premium
  • cut_Very Good
  • cut_Good
  • cut_Fair

From the color column, we created 7 more:

  • color_D
  • color_E
  • color_F
  • color_G
  • color_H
  • color_I
  • color_J

From the clarity column, we created 8 more:

  • clarity_IF
  • clarity_VVS1
  • clarity_VVS2
  • clarity_VS1
  • clarity_VS2
  • clarity_SI1
  • clarity_SI2
  • clarity_I1

Splitting the Dataset¶

Now that we've prepared the features we will need, and because all the dataset came from one file, we will need to split the dataset.

In [31]:
from sklearn.model_selection import ShuffleSplit

df2 = copy.deepcopy(df)

# we want to predict the X and y data as follows:
if 'price_range' in df:
    # get target label
    y = df['price_range'].values 

    # delete the price_range column from data
    del df['price_range']

    # standardize the data    
    norm_features = ['length', 'width', 'depth', 'carat', 'depth_percentage','table']

    df[norm_features] = (df[norm_features]-df[norm_features].mean()) / df[norm_features].std()
    
    # convert to numpy array
    X = df.to_numpy()

# cross validation for model
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits = num_cv_iterations, test_size  = 0.2)
                         
print(cv_object)
ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)
In [32]:
plotly.offline.init_notebook_mode()

# create graph
graph1 = {'x': np.unique(y), 'y': np.bincount(y)/len(y), 'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Multi Class Distribution',
                'autosize':False,
                'width':400,
                'height':400}

# plot graph
plotly.offline.iplot(fig)

From the above graph, we observe that the split among the different price_range categories is not balanced. An 80/20 dataset split is most helpful when the classes are balanced. With this dataset however, we cannot recommend using the 80/20 dataset split technique.

Modeling¶

Custom Model¶

Before we can implement the various custom models, we must first create a base class that will serve as the foundation for our other models.

In [33]:
from sklearn.preprocessing import StandardScaler
from sklearn import metrics as mt

X = StandardScaler().fit(X).transform(X)
In [34]:
from scipy.special import expit

# create base binary logistic regression class
class BinaryLogisticRegression:
    def __init__(self, eta, iterations = 20, C = 0.001):
        self.eta = eta
        self.iters = iterations
        self.C = C
        
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_)
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) 
    
    @staticmethod
    def _sigmoid(theta):
        return expit(theta)
    
    # vectorized gradient calculation with regularization using L2 Norm
    def _get_gradient(self,X,y):
        ydiff = y-self.predict_proba(X,add_bias=False).ravel() 
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) 
        
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C
        
        return gradient
    
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_)
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) 
    
    
    def fit(self, X, y):
         # add bias term
        Xb = self._add_bias(X)
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1))

        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)

             # multiply by learning rate 
            self.w_ += gradient*self.eta

We will vectorize our Logistic Regression model since our model will be able to compute faster using vectors instead of for-loops.

In [35]:
class VectorBinaryLogisticRegression(BinaryLogisticRegression):
    @staticmethod
    def _sigmoid(theta):
        return expit(theta)
    
    def _get_gradient(self,X,y):
        # get y difference
        ydiff = y-self.predict_proba(X,add_bias=False).ravel()

         # make ydiff a column vector and multiply through
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        
        return gradient.reshape(self.w_.shape)
In [36]:
from scipy.optimize import fmin_bfgs
from numpy import ma

class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
    @staticmethod
    def objective_function(w,X,y,C):
        g = expit(X @ w)
        
        return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C*sum(w**2) 

    @staticmethod
    def objective_gradient(w,X,y,C):
        g = expit(X @ w)

         # get y difference
        ydiff = y-g
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        gradient = gradient.reshape(w.shape)
        gradient[1:] += -2 * w[1:] * C
        return -gradient
    
    def fit(self, X, y):
         # add bias term
        Xb = self._add_bias(X)
        num_samples, num_features = Xb.shape
        
        self.w_ = fmin_bfgs(self.objective_function, # what to optimize
                            np.zeros((num_features,1)), # starting point
                            fprime=self.objective_gradient, # gradient function
                            args=(Xb,y,self.C), # extra args for gradient and objective function
                            gtol=1e-03, # stopping criteria for gradient, |v_k|
                            maxiter=self.iters, # stopping criteria iterations
                            disp=False)
        
        self.w_ = self.w_.reshape((num_features,1))
In [37]:
class MultiClassLogisticRegression:
    def __init__(self, eta, iterations=20, 
                 C=0.0001, 
                 solver=BFGSBinaryLogisticRegression):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.classifiers_ = []
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape

         # get each unique class value
        self.unique_ = np.sort(np.unique(y))
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []

         # for each unique value
        for i,yval in enumerate(self.unique_):
            y_binary = np.array(y==yval).astype(int)
            
            hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
            hblr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(hblr)
            
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            # get probability for each classifier
            probs.append(hblr.predict_proba(X).reshape((len(X),1)))
        
        # make into single matrix
        return np.hstack(probs)
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1)

Steepest Ascent¶

The Steepest Ascent method is an algorithm that, at each iteration, the model takes the steepest direction it can take in order to reduce the gradient to zero or some set tolerance. This method can be useful when Newton's Method is unreliable in finding our desired outcome, when it is not possible to find the maximum of a function analytically. Thus, using the steepest ascent method allows for using an iterative method for obtaining an approximate solution.

In [38]:
class LogisticRegression:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
       
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape

         # get each unique class value
        self.unique_ = np.unique(y)
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] 

        # for each unique value
        for i,yval in enumerate(self.unique_):

            # create a binary problem
            y_binary = (y==yval)

            # train the binary classifier for this class
            blr = VectorBinaryLogisticRegression(self.eta,
                                                 self.iters)
            blr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
            # get probability for each classifier
            probs.append(blr.predict_proba(X))

         # make into single matrix
        return np.hstack(probs)
    
    def predict(self,X):
        return self.unique_[np.argmax(self.predict_proba(X),axis=1)]

Binary Stochastic Gradient Descent¶

The Stochastic Gradient Descent is an algorithm linked with random probability, a single sample, or a batch of an instance, is selected randomly instead of the whole data set for each iteration. This solves the problem having a computationally expensive problem when the whole dataset is the batch that is used in the optimization technique. The gradient descent method is used to find the local minimum of an objective function; it uses the gradient to descend the loss curve towards a minimum.`

In [39]:
class StochasticLogisticRegression(BinaryLogisticRegression):
    # stochastic gradient calculation 
    def _get_gradient(self,X,y):
        # grab random instance
        idx = int(np.random.rand()*len(y)) 

        # get scalar y difference
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False)

        # make ydiff a column vector and multiply through
        gradient = X[idx] * ydiff[:,np.newaxis]
        
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C
        
        return gradient

Newton's Method¶

Newton’s method, rather than finding the maxima or minima, is an algorithm that finds the root of a polynomial function. This is applied to the derivative of the cost function, rather than the cost function itself. Compared to gradient descent, Newton’s method can only be applied when the cost function is differentiable twice, not just once.

In [40]:
from numpy.linalg import pinv
class HessianBinaryLogisticRegression(BinaryLogisticRegression):
    # just overwrite gradient function
    def _get_gradient(self,X,y):
         # get sigmoid value for all classes
        g = self.predict_proba(X,add_bias=False).ravel()

        # calculate the hessian
        hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C

        # get y difference
        ydiff = y-g

        # make ydiff a column vector and multiply through
        gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) 
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.C
        
        return pinv(hessian) @ gradient
    

L2 Regularization¶

Overfitting is when the learned hypothesis fits the training data so well that it hurts the model’s performance on any new, unseen instances. The implementation of gradient descent has no regularization, so it is likely to overfit the training data.

We use L2 regularization, in order to help control overfitting of our model. This method forces weights to be small, but not too small to be exactly zero. By incentivizing the weights to be smaller in magnitude, it helps to prevent the sigmoid from having such a large slope that it becomes more vertical. The regularization term that we add to the loss function is the sum of squares of all the feature weights. By keeping the weights relatively small, it is the same as having smaller multipliers in the sigmoid function.

In [41]:
class RegularizedBinaryLogisticRegression(VectorBinaryLogisticRegression):
    # extend init functions
    def __init__(self, C=0.0, **kwds):        
        # need to add to the original initializer 
        self.C = C
        # but keep other keywords
        super().__init__(**kwds) # call parent initializer
        
        
    # extend previous class to change functionality
    def _get_gradient(self,X,y):
        # call get gradient from previous class
        gradient = super()._get_gradient(X,y)
        
        # add in regularization (to all except bias term)
        gradient[1:] += -2 * self.w_[1:] * self.C
        return gradient
In [42]:
class RegularizedLogisticRegression(LogisticRegression):
    def __init__(self, C=0.0, **kwds):        
        # need to add to the original initializer 
        self.C = C

        super().__init__(**kwds)
        
    def fit(self,X,y):
        num_samples, num_features = X.shape

         # get each unique class value
        self.unique_ = np.unique(y)
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []


        # for each unique value
        for i,yval in enumerate(self.unique_): 
            # create a binary problem
            y_binary = y==yval 

            blr = RegularizedBinaryLogisticRegression(eta=self.eta,
                                                      iterations=self.iters,
                                                      C=self.C)
            blr.fit(X,y_binary)
            
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T

Comparing Custom Models¶

In [43]:
%%time
# Steepest Descent
lr = LogisticRegression(0.1,1500)
lr.fit(X,y)

yhat1 = lr.predict(X)
print(f'Accuracy of Model with Steepest Ascent: {mt.accuracy_score(y,yhat1) * 100}%')

# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax  = plt.subplots(figsize=(10,10))

tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat1)), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Steepest Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()
Accuracy of Model with Steepest Ascent: 75.27345728812924%
CPU times: total: 57.2 s
Wall time: 49.2 s
In [44]:
%%time
# Binary Stochastic Gradient Descent
slr = MultiClassLogisticRegression(eta=0.05, iterations=2500, C=0.0001, solver = StochasticLogisticRegression)
slr.fit(X,y)

yhat2 = slr.predict(X)
print(f'Accuracy of Model with Stochastic Gradient Descent: {mt.accuracy_score(y,yhat2) * 100}%')

# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax  = plt.subplots(figsize=(10,10))

tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat2)), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Stochastic Gradient Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()
Accuracy of Model with Stochastic Gradient Descent: 67.00500947484943%
CPU times: total: 438 ms
Wall time: 431 ms
In [45]:
%%time
# Newton's Method
hlr = HessianBinaryLogisticRegression(eta=0.1, iterations=4, C=0.01) 
hlr.fit(X,y)

yhat3 = hlr.predict(X)
print(f'Accuracy of Model with Newton\'s Method: {mt.accuracy_score(y,yhat3) * 100}%')

# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax  = plt.subplots(figsize=(10,10))

tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat3)), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of Newton\'s Method Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()
Accuracy of Model with Newton's Method: 17.390570179553087%
CPU times: total: 7min 7s
Wall time: 26.7 s

When it comes to all of our custom models, our steepest ascent model performed the best with a 77% accuracy, a significant difference from our Stochastic Gradient Descent being 60% and Newton's Method being 26%. Therefore, we will move on with our steepest ascent model.

Scikit Learn Model¶

In order to determine whether our accuracy of our custom model is good, we need a standard model to compare it to. We'll be using the scikit-learn logistic regression model for this.

In [46]:
# import the necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

logModel = LogisticRegression(random_state=16)

# fit the model
logModel.fit(X, y)

# make predictions
yhat4 = logModel.predict(X)

# print accuracy percentage of model
print("Accuracy: ", metrics.accuracy_score(y, yhat4) * 100,"%")
Accuracy:  89.84221092328187 %
c:\users\ibrahim\code\saminas_homework\machine_learning_imag_processing\.venv\image_processing\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [47]:
# import the necessary libraries
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt

# print the confusion matrix
c_matrix= metrics.confusion_matrix(y, yhat4)

class_names = df2.price_range.unique()
fig,ax  = plt.subplots(figsize=(10,10))

tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(c_matrix), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Scikit Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()

Model Comparison¶

In [48]:
# print accuracy of steepest ascent model
print("Accuracy: ", metrics.accuracy_score(y, yhat1) * 100,"%")

# print accuracy of scikit model
print("Accuracy: ", metrics.accuracy_score(y, yhat4) * 100,"%")
Accuracy:  75.27345728812924 %
Accuracy:  89.84221092328187 %

Against the results of both models, the scikit model outperforms our custom model by a significant margin. Not only is it more accurate, but it is also significantly faster.

We will use heatmaps of their performance to analyze further details.

In [49]:
# print the confusion matrix
class_names = df2.price_range.unique()
fig,ax  = plt.subplots(figsize=(20,10))

tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

plt.subplot(1,2,1)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat1)), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Steepest Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.subplot(1,2,2)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat4)), annot = True, cmap = "YlGnBu" ,fmt = 'g')

plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Scikit Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

plt.show()
C:\Users\Ibrahim\AppData\Local\Temp\ipykernel_5788\3598540272.py:9: MatplotlibDeprecationWarning:

Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.

When it came to making correct predictions on Label 0 and Label 4 diamonds, our model ourperformed the scikit model. However, the scikit model did better when it came to correctly predicting the other types of diamonds. Additionally, our model made more wrong predictions in sections adjacent to the diagonal values in comparison to the scikit model by a significant margin (170 vs 469, 1322 vs 549).

Deployment¶

We recommend the scikit model over our own models. Not only is the scikit model more accurate, but it is much more faster. Since our model was created from scratch, we may have missed opportunities to optimize our code, in comparison to the python library that has been extensively developed and expanded upon.

Graduate Analysis¶

In [50]:
params = dict(eta=0.01, iterations=500)
In [51]:
class MSEVectorBinaryLogisticRegression(BinaryLogisticRegression):
    # inherit from our previous class to get same functionality
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    # but overwrite the gradient calculation
    def _get_gradient(self,X,y):
        ydiff = np.square(y-self.predict_proba(X,add_bias=False).ravel()) # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        return gradient.reshape(self.w_.shape)
    
# use same params as defined above
MSEblr = MSEVectorBinaryLogisticRegression(**params)
MSEblr.fit(X,y)
In [52]:
# allow for the user to specify the algorithm they want to sovle the binary case
class MseLogisticRegression:
    def __init__(self, eta, iterations=20, 
                 C=0.0001, 
                 solver=MSEVectorBinaryLogisticRegression):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = np.array(y==yval).astype(int) # create a binary problem
            # train the binary classifier for this class
            
            hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
            hblr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(hblr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
In [53]:
%%time
lr = MseLogisticRegression(eta=1, iterations=500, C=0.001, solver=BFGSBinaryLogisticRegression)
lr.fit(X,y)

yhat = lr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat))
Accuracy of:  0.7801459689675229
CPU times: total: 2.77 s
Wall time: 2.44 s

We tried the BFGS model and we did not geta promising accuracy, so we tried to implement option 1 that is implemented techniques for logistic regression using mean square error, and we had a better accuracy.

BFGS Method for Multiclass Logistic Regression(Quasi-Newton Method)¶

The BFGS algorithm is a quasi-Newton method for optimization problems. In this method, we use the second derivative of the objective function to find the direction of the next step and take that step. This approach can be more efficient than methods that only use the gradient of the objective function because it takes into account the curvature of the function.

In [54]:
bfgslr = BFGSBinaryLogisticRegression(_,iterations = 2, C = 0.001) # note that we need only a few iterations here
bfgslr.fit(X,y)

yhat4 = bfgslr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat4))
Accuracy of:  0.24914163492748456

In the context of logistic regression, we can use BFGS to optimize the objective function that we want to maximize/minimize. The objective function in logistic regression is the log-likelihood function, and we want to maximize it to find the optimal weights that separate the classes. We can use BFGS to find the optimal weights by iteratively updating the weights based on the negative gradient and the approximated Hessian matrix until convergence.

In this case, the BFGS method is used to optimize the parameters of a logistic regression model to predict the binary outcome of whether a diamond is "good" or "bad" based on its features. Specifically, the BFGSBinaryLogisticRegression class is used to train a binary logistic regression model for each unique class value of the target variable. The MultiClassLogisticRegression class then uses these binary classifiers to make multiclass predictions.

In [55]:
# allow for the user to specify the algorithm they want to sovle the binary case
class MultiClassLogisticRegression:
    def __init__(self, eta, iterations=20, 
                 C=0.0001, 
                 solver=BFGSBinaryLogisticRegression):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = np.array(y==yval).astype(int) # create a binary problem
            # train the binary classifier for this class
            
            hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
            hblr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(hblr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
In [56]:
# run logistic regression and vary some parameters
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
lr_clf = MultiClassLogisticRegression(eta=1,
                                  iterations=500,
                                  C=0.0001,
                                  solver=BFGSBinaryLogisticRegression
                                 )
# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1
    
# Also note that every time you run the above code
#   it randomly creates a new training and testing set, 
#   so accuracy will be different each time
====Iteration 0  ====
accuracy 0.8273921200750469
confusion matrix
 [[2289  160    1    0    0]
 [ 319 1781  249    0    0]
 [   0  307 1436  300    0]
 [   1    6  236 1365  193]
 [   0    0    1   67 1949]]
====Iteration 1  ====
accuracy 0.8227954971857411
confusion matrix
 [[2285  158    1    0    0]
 [ 355 1738  270    0    0]
 [   0  308 1405  308    0]
 [   0    7  208 1401  200]
 [   0    0    4   70 1942]]
====Iteration 2  ====
accuracy 0.8314258911819887
confusion matrix
 [[2336  156    0    0    0]
 [ 305 1724  262    0    0]
 [   2  316 1447  283    0]
 [   0    7  216 1365  186]
 [   0    0    4   60 1991]]
In [57]:
from ipywidgets import widgets as wd

num_cv_iterations = 10
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)

def lr_explor(cost):
    print('Running')
    lr_clf = MultiClassLogisticRegression(eta=1,
                                  iterations=500,
                                  C=float(cost),
                                  solver=BFGSBinaryLogisticRegression
                                 )
    acc = []
    for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
        lr_clf.fit(X[train_indices],y[train_indices])  # train object
        y_hat = lr_clf.predict(X[test_indices]) # get test set predictions
        acc.append(mt.accuracy_score(y[test_indices],y_hat))
        
    acc = np.array(acc)
    print(acc.mean(),'+-',2.7*acc.std())
        
wd.interact(lr_explor,cost=list(np.logspace(-4,1,15)),__manual=True)
interactive(children=(Dropdown(description='cost', options=(0.0001, 0.00022758459260747887, 0.0005179474679231…
Out[57]:
<function __main__.lr_explor(cost)>

We don't have any data snooping due to even distribution of classes. Our model expect that if we collect another 20 percent of data, those C and optimization work well with new set of data. Data snooping is anologous to overfitting and even distribution helps allievate both conditions.

Visualization of BFGS¶

In [58]:
%%time

# alternatively, we can also graph out the values using boxplots
num_cv_iterations = 10
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)

def lr_explor(cost):
    lr_clf = MultiClassLogisticRegression(eta=1,
                                  iterations=500,
                                  C=float(cost),
                                  solver=BFGSBinaryLogisticRegression
                                 )
    acc = []
    for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
        lr_clf.fit(X[train_indices],y[train_indices])  # train object
        y_hat = lr_clf.predict(X[test_indices]) # get test set predictions
        acc.append(mt.accuracy_score(y[test_indices],y_hat))
        
    acc = np.array(acc)
    return acc

costs = np.logspace(-5,1,20)
accs = []
for c in costs:
    accs.append(lr_explor(c))
CPU times: total: 6min 37s
Wall time: 5min 31s
In [59]:
# now show a boxplot of the data across c
from matplotlib import pyplot as plt
%matplotlib inline

plt.boxplot(accs)
plt.xticks(range(1,len(costs)+1),['%.4f'%(c) for c in costs],rotation='vertical')
plt.xlabel('C')
plt.ylabel('validation accuracy')
plt.show()
In [60]:
%%time
lr = MultiClassLogisticRegression(eta=1,
                                  iterations=500,
                                  C=0.001,
                                  solver=BFGSBinaryLogisticRegression
                                 )
lr.fit(X,y)
print(lr)

yhat = lr.predict(X)
print('Accuracy of: ',accuracy_score(y,yhat))
MultiClass Logistic Regression Object with coefficients:
[[-7.43513578e+00 -1.30449904e+00  1.82580447e-02 -1.81746589e-01
  -2.64026020e+00 -2.50768781e+00 -2.55068645e+00 -4.73933978e-01
  -3.79413903e-01 -2.16055418e-01 -2.74123176e-02  3.60969134e-01
   5.66155500e-01  4.38001620e-01  2.75640245e-01 -5.36650104e-01
   4.08579709e-01  5.73903426e-01 -1.96701761e-01  9.22558919e-04
  -5.60397354e-01 -3.95241951e-01 -1.50433423e-01  1.48452725e-01
  -9.61263084e-02 -1.57340966e-01  2.36215517e-01]
 [-1.99941906e+00 -4.07153588e+00 -3.15012236e-02  7.18268896e-03
   9.20573324e-01  4.40829759e-01  9.14292689e-01  9.27647179e-02
   8.55000267e-02  1.89618755e-02  2.68520743e-02 -1.28301612e-01
  -1.51520210e-01  4.40133747e-02  1.02730835e-01  1.10235723e-01
  -4.79696972e-02 -1.10337441e-01  9.13837653e-03  3.11201732e-02
   8.96042602e-02 -2.95465082e-02  9.60909389e-02 -4.31997847e-02
   5.19241627e-02  4.90909926e-02 -1.21510945e-01]
 [-1.72683155e+00 -4.99047880e+00  1.14335731e-01  1.69083301e-01
   1.70209867e+00  1.55883990e+00  1.55502148e+00 -2.40995100e-02
   6.37551723e-03  4.98727510e-02 -6.78023038e-02 -4.22641866e-02
   7.55839034e-02  2.98906533e-02  1.43113579e-01 -5.23278017e-02
  -2.64694528e-02  1.59819927e-01 -2.83810422e-02 -1.03747999e-01
  -4.43115014e-02  3.03153924e-02  8.00034953e-02  3.16831773e-02
   6.74306579e-03 -8.25780755e-02  2.39927976e-02]
 [-2.46600858e+00 -3.70381977e+00  1.50585276e-01  1.04001245e-01
   1.80472273e+00  1.52193469e+00  1.63054987e+00  3.38299308e-02
  -2.41419179e-02 -2.96648767e-02 -3.72413018e-02  1.03436161e-01
  -2.86532499e-02 -1.83425631e-02 -4.16141092e-02 -4.78540331e-01
   3.50933205e-01  2.22325428e-01 -5.00396510e-02  2.77472900e-02
  -2.13657179e-01 -2.85254650e-01 -1.92607839e-02  7.21310770e-02
  -9.59292798e-02  1.71178337e-02  5.30153626e-02]
 [-6.25907170e+00  2.48069909e+00 -2.49208932e-02 -7.21848750e-02
   1.73291457e+00  1.69324600e+00  1.64840286e+00  3.30656499e-01
   3.34569222e-01  3.00556510e-01  1.73881675e-01 -3.02006099e-01
  -5.42128242e-01 -6.88151437e-01 -8.08599163e-01  5.72484245e-01
  -6.71072951e-01 -1.16572076e+00  5.82361574e-01  2.31802741e-01
   6.42714046e-01  8.52669161e-01 -2.45161702e-01 -1.12645611e-01
   1.63526432e-01 -5.31788752e-02  4.05805388e-02]]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:10

NameError: name 'accuracy_score' is not defined
In [61]:
from matplotlib import pyplot as plt
import copy
%matplotlib inline
plt.style.use('ggplot')

def plot_decision_boundaries(lr,Xin,y,title=''):
    Xb = copy.deepcopy(Xin)
    lr.fit(Xb[:,:2],y) # train only on two features

    h=0.01
    # create a mesh to plot in
    x_min, x_max = Xb[:, 0].min() - 1, Xb[:, 0].max() + 1
    y_min, y_max = Xb[:, 1].min() - 1, Xb[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # get prediction values
    Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.5)

    # Plot also the training points
    plt.scatter(Xb[:, 0], Xb[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlabel('Independent Value')
    plt.ylabel('Dpendent value')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(title)
    plt.show()
    
lr = BFGSBinaryLogisticRegression(0.1,1500) # this is still OUR LR implementation, not sklearn
plot_decision_boundaries(lr,X,y)

Mean Square Error¶

In [62]:
params = dict(eta=0.01, iterations=500)
In [63]:
class MSEVectorBinaryLogisticRegression(BinaryLogisticRegression):
    # inherit from our previous class to get same functionality
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    # but overwrite the gradient calculation
    def _get_gradient(self,X,y):
        ydiff = np.square(y-self.predict_proba(X,add_bias=False).ravel()) # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
        return gradient.reshape(self.w_.shape)
    
# use same params as defined above
MSEblr = MSEVectorBinaryLogisticRegression(**params)
MSEblr.fit(X,y)
In [64]:
# allow for the user to specify the algorithm they want to sovle the binary case
class MseLogisticRegression:
    def __init__(self, eta, iterations=20, 
                 C=0.0001, 
                 solver=MSEVectorBinaryLogisticRegression):
        self.eta = eta
        self.iters = iterations
        self.C = C
        self.solver = solver
        self.classifiers_ = []
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.sort(np.unique(y)) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = []
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = np.array(y==yval).astype(int) # create a binary problem
            # train the binary classifier for this class
            
            hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
            hblr.fit(X,y_binary)

            # add the trained classifier to the list
            self.classifiers_.append(hblr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for hblr in self.classifiers_:
            probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
In [66]:
%%time
lr = MseLogisticRegression(eta=1, iterations=500, C=0.001, solver=BFGSBinaryLogisticRegression)
lr.fit(X,y)

yhat = lr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat))
Accuracy of:  0.7801459689675229
CPU times: total: 2.69 s
Wall time: 2.4 s

We tried the BFGS model and we did not geta promising accuracy, so we tried to implement option 1 that is implemented techniques for logistic regression using mean square error, and we had a better accuracy.

Code used in this notebook was adapted from Professor Larson's notebooks for this class.

We appreciate and acknowledge Professor Larson's base model.